Process Mining

Process mining is the art of discovering, monitoring, and improving processes by extracting knowledge from event logs available in information systems. Process mining results in knowledge about real processes followed in an organization that might be helpfull when resolving process bottlenecks or when checking if processes are compliant.

bupaR is an R package that allows to perform process mining activities in R. At the end of this page you can find more details about bupaR.

An example of process mining is shown below.
We start loading the necesssary libraries into R and by reading the log file (which is in an excel format for this example):

library(tidyverse)
library(bupaR)
library(openxlsx)
library(processanimateR)
library(heuristicsmineR)
library(petrinetR)
library(lubridate)
library(tictoc)
library(knitr)
library(pheatmap)
library(maditr)
library(viridis)
library(forcats)

# start timer
tic("Time to complete")

# setup section:
filename<-"requisitions.xlsx"

# Read in a fresh file?
read_file<-TRUE

# filter setup


# Read file if required
if (read_file) {
eventlog<-read.xlsx(filename,sheet=1,detectDates=TRUE,colNames=TRUE)

eventlog$TimeStamp<-as.POSIXct(eventlog$TimeStamp, tz = "", format="%d/%m/%y %H:%M:%OS" ,
           tryFormats = c("%Y-%m-%d %H:%M:%OS",
                          "%Y/%m/%d %H:%M:%OS",
                          "%d/%m/%Y %H:%M:%OS",
                          "%Y-%m-%d %H:%M",
                          "%Y/%m/%d %H:%M",
                          "%Y-%m-%d",
                          "%Y/%m/%d"),
           optional = FALSE)
}
eventlog$dow<-weekdays(as.Date(eventlog$TimeStamp))
eventlog<-eventlog[!is.na(eventlog$CaseID),]
eventlog$dow <- ordered(eventlog$dow, levels=c("maandag", "dinsdag", "woensdag", "donderdag", 
 "vrijdag", "zaterdag", "zondag"))


eventlog <- eventlog %>%  
  mutate(dow = fct_recode(dow,
    "Monday" = "maandag",
    "Tuesday" = "dinsdag",
    "Wednesday" = "woensdag",
    "Thursday" = "donderdag",
    "Friday" = "vrijdag",
    "Saturday" = "zaterdag",
    "Sunday" = "zondag"
  )) 

The log file read in looks as follows (first 10 lines shown):

kable(eventlog[1:10,])
CaseID Regel TimeStamp Activity status Approver EnteredBy Year dow
REQ-2094 NA 2010-02-01 08:21:07 Approve complete app001 req001 2010 Monday
REQ-2094 NA 2010-02-01 08:21:09 Routing complete app001 req001 2010 Monday
REQ-2095 NA 2010-02-01 08:21:25 Approve complete app001 req001 2010 Monday
REQ-2095 NA 2010-02-01 08:21:26 Routing complete app001 req001 2010 Monday
REQ-2096 NA 2010-02-01 08:21:34 Approve complete app001 req001 2010 Monday
REQ-2096 NA 2010-02-01 08:21:36 Routing complete app001 req001 2010 Monday
REQ-2097 1 2010-02-01 08:42:17 Create REQ Line complete app002 req002 2010 Monday
REQ-2097 NA 2010-02-01 08:42:18 Create REQ complete app002 req002 2010 Monday
REQ-2097 NA 2010-02-01 08:42:28 Routing complete app002 req002 2010 Monday
REQ-2098 1 2010-02-01 09:26:15 Create REQ Line complete app003 req003 2010 Monday
print(paste("Number of records read in from the log file: ",nrow(eventlog),sep=""))
## [1] "Number of records read in from the log file: 284396"

The log file is converted to an eventlog object:

# convert data frame into an eventlog object
# activity_instance is set equal to the row number

evLog<-eventlog %>%
  mutate(activity_instance = 1:nrow(.)) %>%
  eventlog(
    case_id = "CaseID",
    activity_id = "Activity",
    lifecycle_id = "status",
    activity_instance_id = "activity_instance",
    timestamp = "TimeStamp",
    resource_id = "Approver"
  )

For this example we select only records with a date in 2014 and we keep only complete cycles (from start to end):

# filter on year
evLog<-evLog[evLog$Year==2014,]

# filter on begin and end activities
evLog <- evLog %>%
  filter_endpoints(start_activities = c("Create REQ","Create REQ Line"), end_activities = c("Create PO","Create PO Line"))

print(paste("Number of records remaining after filtering: ",nrow(evLog),sep=""))
## [1] "Number of records remaining after filtering: 45390"

Some statistics about our event log extract:

  • mapping used from log file to event log:
# overview of the mapping used
mapp<-evLog %>% mapping()
kable(as.data.frame(mapp))
case_identifier activity_identifier activity_instance_identifier timestamp_identifier lifecycle_identifier resource_identifier
CaseID Activity activity_instance TimeStamp status Approver
  • activities found in the event log:
# overview of the different activities
mapp<-evLog %>% activity_labels()
kable(as.data.frame(mapp))
mapp
Create PO
Create PO Line
Create REQ Line
Create REQ
Routing
Approve
Change REQ Line
Change REQ
Delete REQ Line
Delete REQ
  • frequency of the activities as found in the event log:
# overview of the mapping used
mapp<-evLog %>% activities()
kable(as.data.frame(mapp))
Activity absolute_frequency relative_frequency
Create REQ Line 13992 0.3082617
Create PO Line 10810 0.2381582
Routing 7045 0.1552104
Approve 4756 0.1047808
Create PO 2921 0.0643534
Create REQ 2348 0.0517295
Change REQ Line 2147 0.0473012
Change REQ 1284 0.0282882
Delete REQ Line 86 0.0018947
Delete REQ 1 0.0000220
  • activity duration:

Note: In below table the durations are all 0 because the log file used does only contain timestamps for the completion of the activities. If both start and end of activities are logged the durations can be calculated. That information can also be used to calculate resource utilisation and resource idle times.

# overview of the mapping used
mapp<-evLog %>% processing_time("activity",units="mins")
kable(as.data.frame(mapp))
Activity min q1 mean median q3 max st_dev iqr total relative_frequency
Create REQ Line 0 0 0 0 0 0 0 0 0 0.3082617
Create PO Line 0 0 0 0 0 0 0 0 0 0.2381582
Routing 0 0 0 0 0 0 0 0 0 0.1552104
Approve 0 0 0 0 0 0 0 0 0 0.1047808
Create PO 0 0 0 0 0 0 0 0 0 0.0643534
Create REQ 0 0 0 0 0 0 0 0 0 0.0517295
Change REQ Line 0 0 0 0 0 0 0 0 0 0.0473012
Change REQ 0 0 0 0 0 0 0 0 0 0.0282882
Delete REQ Line 0 0 0 0 0 0 0 0 0 0.0018947
Delete REQ 0 0 0 0 0 0 NA 0 0 0.0000220
  • Resource activity

The below table is indicating how many of the activities are carried out by each resource.

mapp<-evLog %>% resource_frequency("resource") 
kable(as.data.frame(mapp))
Approver absolute relative
app032 12945 0.2851950
app006 7599 0.1674157
app057 5281 0.1163472
app005 3777 0.0832122
app016 2356 0.0519057
app008 1461 0.0321877
app066 1231 0.0271205
app047 1026 0.0226041
app024 1014 0.0223397
app036 915 0.0201586
app025 893 0.0196739
app009 771 0.0169861
app048 764 0.0168319
app003 595 0.0131086
app037 560 0.0123375
app021 511 0.0112580
app035 390 0.0085922
app033 361 0.0079533
app007 351 0.0077330
app015 335 0.0073805
app045 267 0.0058824
app026 262 0.0057722
app013 228 0.0050231
app063 222 0.0048909
app034 203 0.0044724
app038 196 0.0043181
app011 163 0.0035911
app010 152 0.0033488
app061 129 0.0028420
app058 103 0.0022692
app065 71 0.0015642
app023 63 0.0013880
app044 58 0.0012778
app062 53 0.0011677
app056 42 0.0009253
app030 21 0.0004627
app029 12 0.0002644
app027 8 0.0001763
app052 1 0.0000220
  • Heatmap of activities:

using a heatmap we can check if certain activities are executed more on a specific day in the week:

week_activity <- eventlog %>% 
  group_by(Activity, dow) %>%
  summarise(Freq = n())
week_activity<-week_activity[complete.cases(week_activity),]

p <- ggplot(week_activity, aes(dow, Activity)) + geom_tile(aes(fill = log(Freq)), colour = "white") +
  labs(x="Weekday",y="Activity") +
  scale_fill_viridis()

png(file="heatmap.png",width=1500,height=750)
p
dev.off()
## png 
##   2
include_graphics("heatmap.png")

Dependency Matrix:

In this example we are using the heuristics miner. Other miners available for bupaR are the alpha miner and the inductive miner. (threshold is set to 0.5, covering half of the activities in the log)

# dependency matrix with threshold
dependency_matrix(evLog, threshold = .5) %>% render_dependency_matrix()

Causal net:

# causal net with threshold
causal_net(evLog, threshold = .5) %>% render_causal_net()

Precedence Matrix:

Preceeding activities are found in the rows. The numbers are the number of times an activity in the top row is preceeded by an activity in the first column.

# Efficient precedence matrix
m <- precedence_matrix_absolute(evLog)
kable(as.matrix(m))
Approve Change REQ Change REQ Line Create PO Create PO Line Create REQ Create REQ Line Delete REQ Delete REQ Line End Routing Start
Approve 86 0 1 14 5 0 0 0 0 0 4650 0
Change REQ 216 113 92 6 1 0 54 1 8 0 793 0
Change REQ Line 5 513 1165 0 0 155 277 0 32 0 0 0
Create PO 0 1 0 0 2920 0 0 0 0 0 0 0
Create PO Line 0 0 1 895 7559 0 0 0 0 2355 0 0
Create REQ 0 213 308 56 5 0 151 0 14 0 1601 0
Create REQ Line 0 259 431 0 0 2051 11245 0 6 0 0 0
Delete REQ 0 0 0 0 0 0 1 0 0 0 0 0
Delete REQ Line 0 21 39 0 0 0 9 0 17 0 0 0
End 0 0 0 0 0 0 0 0 0 0 0 0
Routing 4449 164 110 1950 320 0 42 0 9 0 1 0
Start 0 0 0 0 0 142 2213 0 0 0 0 0

Process maps:

(in below maps 90% of the activities are covered, this percentage can be set to avoid showing lesss frequent cases)

  • Relative
evLog <- evLog %>% filter_activity_frequency(percentage=0.9)

evLog %>%
  process_map(type = frequency("relative"), threshold=0.5)
  • Absolute
evLog %>%
  process_map(type = frequency("absolute"), threshold=0.5)
  • Relative - case
evLog %>%
  process_map(type = frequency("relative_case", color_scale = "Purples"), threshold=0.5)
  • Performance (median, days)
evLog %>%
  process_map(performance(median, "days"), threshold=0.5)
  • Performance (mean, hours)
evLog %>%
  process_map(performance(mean, "hours"), threshold=0.5)
  • Performance (edges) and relative frequency (nodes)
evLog %>%
  process_map(type_nodes = frequency("relative_case"),
              type_edges = performance(mean), threshold=0.5)
  • Trace Explorer

In below diagram the coverage was set to 80% of all cases

evLog2<-evLog

evLog2 %>%
  trace_explorer(coverage=0.8) -> traces
png(file="traces.png",width=1000)
traces
dev.off()
## png 
##   2
include_graphics("traces.png")

  • Filtering on trace frequency
evLog<-evLog %>%
  # filter_activity(c("LacticAcid", "CRP", "Leucocytes", "Return ER", "IV Liquid", "IV Antibiotics"), reverse = T) %>%
  filter_trace_frequency(percentage = 0.8)
  • Dotted Chart

The following chart only displays all cases fully contained in a 2 weeks period. It is possible to show cases starting in that one month period, all cases completed in that time interval, or all cases showing some activity iin an4 time interval.

evLog %>%  
  filter_time_period(interval = ymd(c("20140107", "20140121")),filter_method = "contained") %>%
  dotted_chart

  • Animated Log

Using an animated log the cases in the log can be re-played on the process map. The example illustrates this can be very useful when looking for bottlenecks in the process. The process looks simpler in below model as we are only using 90% of the log (based on relative occurrence in the log).

# animate log on process map
animate_process(evLog %>% filter_trace_frequency(percentage = 0.9),
               mode = "relative",
               legend = "CaseId",
               mapping = token_aes(color = token_scale("amount",
                                                       scale = "linear",
                                                       range = c("yellow","red"))))

Stopping the timer:

# stop timer
toc()
## Time to complete: 41.18 sec elapsed

Credits:

citation("processmapR")
## 
## To cite package 'processmapR' in publications use:
## 
##   Gert Janssenswillen (2020). processmapR: Construct Process Maps Using
##   Event Data. R package version 0.3.4.
##   https://CRAN.R-project.org/package=processmapR
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {processmapR: Construct Process Maps Using Event Data},
##     author = {Gert Janssenswillen},
##     year = {2020},
##     note = {R package version 0.3.4},
##     url = {https://CRAN.R-project.org/package=processmapR},
##   }